SuperCD - Wilf's Workshop

WilfÆs ProgrammersÆ Workshop

Continued from the magazine...

Suppose you have a group of records in memory in sorted sequence. If you change the value of one of the items, you may well be destroying the sorted sequence and there will have to be modifications to bring it back into order. What steps are needed? How many comparisons and location swaps are needed? How much extra memory is required for storing vectors?

These are all crucial questions û and depending on the sorting method you use, by no means trivial. Even with the highly efficient quick sort (using binary trees) you can run into some very sticky business. Changing a value so that it affects sorted sequence means cutting the changed item from its current location in the tree (which involves rewiring the vectors of the items immediately below it in the tree) and then making a series of comparisons to find its new logical place.

Unfortunately, even simply adding an item into a quick sort tree can take considerable time if, for example, the tree is skewed û and that can happen when the input occurs in roughly sorted or reverse order: a quick sort works best when data starts out in random order.

A heap is different: the number of comparisons and swaps needed to change or add an item varies only a little: even with a thousand items in the heap, only ten comparisons (at most) need to be done. Can this fact be exploited? Can a heap help us speed up a sort that we suspect will be slow?

How a heap sort works

A heap sort is exactly what it sounds like û a heap construction, followed by a sort. LetÆs take an example. We have a list of items we want in sorted sequence, smallest to greatest. Suppose we use the following numbers, input as a stream in the following unsorted sequence: 20, 111, 184, 226, 204, 79, 125. Using the HeapAdd procedure with the greatest value coming to the top, we will first of all create a heap.

Using the HeapAdd procedure repeatedly (which calls HeapSettle) we first build a top-heavy heap of all the items to be sorted.

This only provides immediate access to the largest number in the heap, which we know will always rise to the top (item 1). We have made the heap û the second half of the task is to turn it into a sorted array. This is easier than you may think.

We take three further steps, and repeat them until the heap is æemptyÆ:

1: Swap the first and last items in the heap; this puts the highest value at the end of the heap.

2: Reduce the size of the heap by one item û effectively leaving off the recently moved highest item.

3: Settle the new first item within the heap using the HeapSettle procedure.

One by one we move the items from back to front, consuming the heap as we go.

Calculating the advantage

Now think about how many operations this will take. Adding each item into the heap will take an average of (log₂n)/2 operations, comparing and swapping, where n is the number of items in the heap. For example log₂128 is seven (because 27 is 128), and that tells us how many levels there are.

A large item may have to be swapped upward through all seven levels, but the average will be about half that. Of course the size of the heap increases all the time, and so log₂n increases with time. However if we take the maximum value of n we will be fairly safe, because that value of log₂n will be accurate for half the ælifeÆ of the heap. Our estimate will be a little high, but in the right ball park.

The second part of the process û reducing the heap while building the sorted array in reverse order û is a little more cumbersome, because HeapSettle will be used to move item 1 downward in the heap. This process requires two comparisons for every level instead of one. But again, each swap/reduce/settle cycle will take an average of (log₂n)/2 operations.

Taking together these two halves, we can see that a heap sort of n items takes about log₂n operations, each consisting of 1.5 comparisons (the average of one for going upward, two for going downward) and one swap. There is not much difference whether the input starts in sorted, reverse-sorted, or random order.

How does this compare with the quick sort, which we have used often in the past? The average number of comparisons you have to make (from the root downwards) to find the proper sorted location for the nth item is log₂n. Even ignoring the time to do a swap of two items, the heap sort takes 50 per cent more time doing comparisons for each operation û and since the number of operations is the same on average, this makes the quick sort a clear winner.

But the story doesnÆt end there. Suppose the items occur in sorted sequence already û or close to sorted sequence. In this unfortunate case the number of layers is no longer log₂n, but a lot closer to n itself, with the tree looking more like a long chain. With 100 items this would mean that a quick sort would take more than nine times as long as a heap sort. With 1000 items the heap sort can outstrip the quick sort by a factor of 60.

What can we say from all this? A heap sort is an excellent means for sorting a large number of items which will be presented already well-sorted (or in æclumpsÆ that would make quick sort slower). This also goes for items that will be presented in more or less the opposite of sorted order. Note that a heap sort works fine with randomly presented data as well, but just not as fast as a quick sort would.

At the start of the first cycle of a heap sort, the first and last items in the heap are swapped, and the heap shortened.

The end of the first cycle sees the heap adjusted with a new ætopÆ item with the highest value.

The second cycle proceeds exactly as the first cycle does: the first and last items are swapped and the heap shortened.

The second cycle ends with a smaller heap, and two sorted items immediately following the heapÆs end.

After several cycles the heap is gone, and in its place all the items appear in sequential order.

Here on the SuperCD are simple Visual Basic routines for heap maintenance and doing a heap sort within a simple two-dimensional array. The first item in each entry of the array is the sequence (or sort) key: though we have been dealing with numbers, alphabetic sequences work just as well. The second item in the array is a vector: it can point to the full record, or perhaps it is a merge file number detailing from which file the record was read, and therefore which file has to be read again.

I like the ætrickÆ of putting the current size of the heap in the sequence key of the zero item of the heap array. You may instead like to have it as a global variable, or a passed variable. Using the zero item for passing vital information about the array itself strikes me as a simple and elegant way of incorporating a bit of object orientation: the array becomes the heap object itself, and item 0,1 becomes its size attribute.

We will be using heaps in various projects in the future û they are extremely useful, and surprisingly easy to manage.

Visual Basic program code examples

Examples of SUBs that will maintain a heap, and the method of using them to perform a heap sort, are demonstrated in PROGCODE.TXT. Feel free to adapt and improve them for your own purposes, and convert them into other computer languages.

To view the text file.

Contents | Top | Help